[gpt-oss] triton kernel mxfp4 by zyongye · Pull Request #22421 · vllm-project/vllm

zyongye · 2025-08-07T03:56:56Z

Need nightly torch and triton main to work.

Don't merge. want for accuracy test

Signed-off-by: <zyy1102000@gmail.com>

github-actions · 2025-08-07T03:57:04Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for mxfp4 quantization on Hopper GPUs by integrating a new Triton kernel for MoE layers. The changes include adding the kernel wrappers, modifying the mxfp4 quantization path to use it, and adding corresponding tests. The implementation looks solid, but I have two high-level concerns. First, the number of warps for the Triton kernel is configured statically based on an environment variable, which might not be optimal or correct for dynamic batch sizes at runtime. Second, a utility function modifies a global configuration flag, which is a risky pattern that could lead to hard-to-debug side effects. Addressing these points would improve the robustness and maintainability of this new feature.

gemini-code-assist · 2025-08-07T03:59:02Z

vllm/model_executor/layers/quantization/mxfp4.py

+            # FIXME warp need to be adjusted based on batch size
+            # only apply to  batched mode
+            if self.moe.use_ep:
+                num_warps = 4 if envs.VLLM_MOE_DP_CHUNK_SIZE <= 512 else 8
+            else:
+                num_warps = 8


The FIXME comment on line 301 indicates that num_warps should be adjusted based on the batch size. The current implementation determines num_warps based on the static environment variable VLLM_MOE_DP_CHUNK_SIZE, which may not reflect the dynamic batch size at runtime. This static configuration could lead to suboptimal performance or potential correctness issues if the Triton kernel has strict requirements for num_warps based on the input size. This value is used during weight loading to swizzle the weights, so it cannot be changed dynamically per batch without re-swizzling. This suggests a potential design issue that should be addressed for robust performance and correctness.

gemini-code-assist · 2025-08-07T03:59:02Z

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

+    if current_platform.is_cuda() and \
+        current_platform.is_device_capability(100):
+        constraints = {
+            "is_persistent": True,
+            "epilogue_subtile": 1,
+        }
+        opt_flags.update_opt_flags_constraints(constraints)


The function _swizzle_mxfp4 modifies a global state by calling opt_flags.update_opt_flags_constraints(constraints). Modifying global state within a utility function is a dangerous pattern as it can introduce non-local side effects that are difficult to debug, especially in a system that might handle multiple models or requests concurrently. This could cause issues if different models or layers have conflicting requirements for these optimization flags. It would be safer to manage this global state with more care, for example, by using a context manager to set and restore the flags, or by passing constraints as parameters to the underlying kernel if the API supports it.

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

WoosukKwon · 2025-08-08T15:22:35Z

vllm/utils/__init__.py

+def has_triton_kernels() -> bool:
+    """Whether the optional `triton_kernels` package is available."""
+
+    return _has_module("triton_kernels")


QQ: How can I install this?

We need to install directly from triton repo

uv pip install triton/python/triton_kernels --no-deps

There's no PyPI wheel yet

minosfuture · 2025-08-08T16:40:12Z

hmm, this broke the trunk

(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/config.py", line 1173, in _verify_quantization
(APIServer pid=1847171)     method = me_quant.get_quantization_config(name)
(APIServer pid=1847171)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/quantization/__init__.py", line 114, in get_quantization_config
(APIServer pid=1847171)     from .mxfp4 import Mxfp4Config
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/quantization/mxfp4.py", line 11, in <module>
(APIServer pid=1847171)     from vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe import (
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py", line 13, in <module>
(APIServer pid=1847171)     import triton_kernels.swiglu
(APIServer pid=1847171) ModuleNotFoundError: No module named 'triton_kernels'

zyongye · 2025-08-08T16:59:24Z

Are you running gpt-oss or other model?

You need to install triton_kernels here

git clone https://github.com/triton-lang/triton
uv pip install triton/python/triton_kernels --no-deps

zyongye · 2025-08-08T17:06:13Z

Pushed a fix #22529

huydhn · 2025-08-08T17:52:30Z

Just FYI that the error shows up on llama4 benchmark run https://github.com/pytorch/pytorch-integration-testing/actions/runs/16834994069/job/47692144587#step:14:3962, so it's other models too

minosfuture · 2025-08-08T21:59:24Z

yea I was running deepseek. The code path is shared. Thanks for the quick fix!

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

yiliu30 · 2025-08-13T12:29:43Z

Hi @zyongye , can we use that kernel on Blackwell? If so, could you provide the Triton commit? I encountered the following issue when running UT locally.

============================== short test summary info ==============================
FAILED test_gpt_oss_triton_kernels.py::test_equiv[1-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[2-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[4-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[8-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[1-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[2-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[4-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[8-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
============================ 8 failed, 1 passed in 4.86s ============================

mgoin · 2025-08-13T13:02:39Z

Hi @yiliu30 for Blackwell SM100 we have kernels from flashinfer available, please see the recipe for details https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#b200

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fxmarty-amd · 2025-11-12T15:08:42Z

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

+    quant_tensor = convert_layout(wrap_torch_tensor(quant_tensor, dtype=FP4),
+                                  value_layout, **value_layout_opts)
+    scale = convert_layout(wrap_torch_tensor(scale), scale_layout,
+                           **scale_layout_opts)
+    return quant_tensor, InFlexData(), scale


Is it safe to unwrap from triton_kernels.tensor.Tensor from here? Could we avoid it in the first place?

This is util function from triton_kernels. It is designed to take triton_kernels.Tensor instead of torch.Tensor.

varun-sundar-rabindranath · 2025-11-18T21:32:22Z

vllm/model_executor/layers/quantization/mxfp4.py

+            del layer.w2_weight
+            layer.w13_weight = None
+            layer.w2_weight = None
+            torch.cuda.empty_cache()


triton kernel mxfp4

8494d50

Signed-off-by: <zyy1102000@gmail.com>

zyongye requested review from WoosukKwon, mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners August 7, 2025 03:56

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

zyongye added 2 commits August 7, 2025 21:52

fix rocm

541aa8e

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add error message

9f0a6f7

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

heheda12345 mentioned this pull request Aug 8, 2025

[gpt-oss] Small bug fixes for frontend #22512

Merged

4 tasks

WoosukKwon approved these changes Aug 8, 2025

View reviewed changes

WoosukKwon merged commit e789cad into vllm-project:main Aug 8, 2025
11 of 14 checks passed

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

c219316

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

5dd566a

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

7671607

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

zyongye deleted the hopper-mxfp4 branch August 15, 2025 05:44

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

b285845

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

400a56a

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

3e94a97

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] triton kernel mxfp4 (vllm-project#22421)

1d52dbd

Signed-off-by: <zyy1102000@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fxmarty-amd reviewed Nov 12, 2025

View reviewed changes

mergify bot added the gpt-oss Related to GPT-OSS models label Nov 12, 2025

varun-sundar-rabindranath reviewed Nov 18, 2025

View reviewed changes

Uh oh!

Conversation

zyongye commented Aug 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

zyongye Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

minosfuture commented Aug 8, 2025

Uh oh!

zyongye commented Aug 8, 2025

Uh oh!

zyongye commented Aug 8, 2025

Uh oh!

huydhn commented Aug 8, 2025

Uh oh!

minosfuture commented Aug 8, 2025

Uh oh!

yiliu30 commented Aug 13, 2025

Uh oh!

mgoin commented Aug 13, 2025

Uh oh!

fxmarty-amd Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

zyongye Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

zyongye commented Aug 7, 2025 •

edited by github-actions bot

Loading

zyongye Aug 8, 2025 •

edited

Loading